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Abstract 

Local structures of shadow boundaries as well as com¬ 
plex interactions of image regions remain largely unex¬ 
ploited by previous shadow detection approaches. In this 
paper, we present a novel learning-based framework for 
shadow region recovery from a single image. We exploit 
the local structures of shadow edges by using a structured 
CNN learning framework. We show that using the struc¬ 
tured label information in the classification can improve 
the local consistency of the results and avoid spurious la¬ 
belling. We further propose and formulate a shadow/bright 
measure to model the complex interactions among image 
regions. The shadow and bright measures of each patch 
are computed from the shadow edges detected in the im¬ 
age. Using the global interaction constraints on patches, we 
formulate a least-square optimization problem for shadow 
recovery that can be solved efficiently. Our shadow recov¬ 
ery method achieves state-of-the-art results on the major 
shadow benchmark databases collected under various con¬ 
ditions. 


1. Introduction 

Shadow detection has long been considered a crucial 
component of scene interpretation. Shadows in an im¬ 
age provide useful information about the scenes: the ob¬ 
ject shapes [18], the relative positions and 3D structures of 
the scene [2, 1], the camera parameters and geo-location 
cues [11], and the characteristics of the light sources [22, 
14, 19]. However, they can also cause great difficulties to 
many computer vision algorithms, such as background sub¬ 
traction, segmentation, tracking and object recognition. De¬ 
spite its importance and long tradition, shadow detection re¬ 
mains an extremely challenging problem, particularly from 
a single image. The main difficulty is due to the complex 
interactions of geometry, albedo, and illumination in nature 
scenes. Since shadows correspond to a variety of visual 
phenomena, finding a unified approach to shadow detection 
is difficult. 

Motivated by this observation, recent papers [10, 25, 15, 
8, 6, 12] have explored the use of learning techniques for 


shadow detection. These approaches take an image patch 
and compute the likelihood that the centre pixel contains 
a shadow edge. Such a classifier is limited by its locality 
since it treats each pixel independently. Optionally, the in¬ 
dependent shadow predictions may then be combined using 
global reasoning by using a CRF/GBP/MRF algorithm. 

Shadow edges in a local patch are actually highly inter¬ 
dependent, and exhibit common forms of local structures: 
straight lines, corners, curves, parallel lines; while struc¬ 
tures such as T-junctions or Y-junctions are highly unlikely 
on shadow boundaries [8, 15]. 

In this paper we propose a novel learning-based frame¬ 
work for shadow detection from a single image. We ex¬ 
ploit the local structures of shadow edges by using a struc¬ 
tured Convolutional Neural Networks (CNN) framework. 
A CNN learning framework is designed to capture the local 
structure information of shadow edges and automatically 
learn the most relevant features. We formulate the prob¬ 
lem of shadow edge detection as predicting local shadow 
edge structures given input image patches. In contrast to 
unary classification, we take structured labelling informa¬ 
tion of the label neighbourhood into account. We show that 
using the structured label information in the classification 
can improve the local consistency of the results and avoid 
spurious labelling. 

We also propose a novel global shadow optimiza¬ 
tion framework. In the previous learning approaches, a 
CRF/GBP/MRF algorithm is usually employed for enforc¬ 
ing the local consistency over neighbouring labels [25, 15, 
12, 6] and the non-local constraints of region pairs with the 
same materials [6]. The size of the label images and the 
presence of loops in such a algorithm make it intractable 
to compute the expectation computations. Moreover, the 
memory requirement for loading all the training data is large 
and parameter updating requires tremendous computing re¬ 
sources. Here, we introduce novel shadow and bright mea¬ 
sures to model the region interactions based on the spatial 
layout of image regions. For each image patch, a shadow 
and a bright measure are computed according to its con¬ 
nectivities to all of the shadow and bright boundaries in 
the image, respectively. The shadow/bright boundaries are 
extracted from the shadow edges detected by the proposed 
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Figure 1: strCNN architecture used for learning the structure of shadow edges. It takes a 28 x 28 color image patch as an 
input, and outputs a 0,vector which is corresponding to the shadow edge structure of the 5 x 5 central patch. 


CNN. Using these shadow and bright measures, we formu¬ 
late a least-square optimization problem for shadow recov¬ 
ery to solve for the shadow map(locations). Our optimiza¬ 
tion framework combines the non-local cues of region in¬ 
teractions in a straightforward and efficient manner where 
all constraints are in linear form. 

Experimental results on the major shadow benchmark 
databases demonstrate the effectiveness of the proposed 
technique. 

1.1. Related Work 

Early works for detecting shadows are motivated by 
physical models of illumination and color [9, 21]. Finlayson 
et al. [5, 4] located shadows by using a 1-d shadow free 
image computed from a single image. Their methods can 
only work under the assumption of approximately Planck- 
ian lighting, and high-quality input images are needed. 

To adapt to environment changes, statistical learning 
based approaches [20, 17, 7, 10] have been developed to 
learn the shadow model at each pixel from a video se¬ 
quence. Recently, some data-driven learning approaches 
have been developed for single-image shadow detection. 
Lalonde et.al [15] detected cast shadow edges on the ground 
with a classifier trained on local features. [25] proposed a 
similar approach for monochromatic images. Every pixel is 
classified as either being inside a shadow or not. The per- 
pixel outputs are inherently noisy with poor contour conti¬ 
nuity. To overcome this, the predicted posteriors are usu¬ 
ally fed to a CRF formulation which defines the pairwise 
smoothness constraints across neighbouring pixels. Guo 
et al. [6] modelled the long-range interactions using the 
non-local cues of region pairs. They then incorporated the 
non-local pairwise constraints into a graph-cut optimiza¬ 
tion. Yago et al. [24] integrated the region, boundary and 
paired regions classifiers using a MRF. All these learning 
approaches employ hand-crafted features as input. 

More recently, [12] proposed a deep learning frame¬ 
work to automatically learn the features for shadow de¬ 
tection. They showed that the CNN with learned features 
outperforms the current state-of-the-art with hand-crafted 


features. They trained a unary classifier where separated 
CNNs learned features for detecting shadows at boundaries 
and uniform regions, respectively. The per-pixel predictions 
are then fed to a CRF for enforcing local consistency. In 
contrast to unary classification, we predict the structure of 
shadow edges in a local patch. Our work is inspired by 
the recent works on learning structured labels for semantic 
image labelling [13] and edge detection [16, 3] in Random 
Forests. Our aim is to explore the structured learning in 
CNN for local shadow edge detection. 

We are also inspired by the work on saliency estima¬ 
tion [26]. They propose a background measure based on 
the boundary connectivity prior that a salient region is 
less likely connected to the image boundary. Utilizing the 
connectivity definition, we derive our shadow measures to 
model the region interactions based on the spatial layout of 
image regions. 

2. Structured Deep Shadow Edge Detection 

In this section, we present our structured CNN learning 
framework, then we explain how to apply it to shadow edge 
detection. 

2.1. Structured Convolutional Neural Networks 

Convolutional neural networks actually can have high di¬ 
mensional and complex output which makes structured out¬ 
put prediction possible. We use x e X = ]^dxdxK 
denote a color image patch, and y G Y = to de¬ 

note the target structured label, where d and s indicate the 
patch widths of x and the y, respectively. K is the number 
of channels. The structured prediction can be expressed as 
C : X ^ Y mapping from an input domain X to a struc¬ 
tured output domain Y. Here, C(') is the structured CNN. 

Fig. 1 shows our network architecture with 7-layers. Our 
learning approach predicts a structured label from a 

larger 28 x 28 image patch. The network consists of two 
alternating convolutional and max-pooling layers, followed 
by a fully connected layer and finally a logistic regression 
output layer with softmax nonlinear function. The first and 
second convolutional layers consist of six and twelve 5x5 



























kernels, respectively, with unit pixel stride. The pooling 
size is 3 X 3 with unit-pixel stride. Sequentially, the fully 
connected layer has 64 hidden units. 

One challenge for training CNN with structured labels is 
that structured output spaces are high dimensional, which 
causes long training duration. We show in the experiments 
that with the setting of we can capture wide variety of 
local shadow structures sufficiently while still keeping the 
training complexity low. 

2.2. Structured Learning Shadow Edges 

We employ the proposed structured CNN for feature 
learning for shadow edge detection. It takes a 28 x 28 color 
image patch as an input, and outputs a {1,0}^^^^ vector 
which is corresponding to the 5 x 5 shadow probability map 
of the central patch. 

Assume we have a set of images X with a correspond¬ 
ing set of binary images Y of shadow edges. The struc¬ 
tured CNN operates on the patches at image edges: only 
the patches that contain image edges at its 5 x 5 central area 
are used. The input patches are extracted from 

X. The corresponding groundtruth are extracted 

from the binary images Y. We randomly sample N{< 800) 
shadow and non-shadow edge patches, respectively, per im¬ 
age for training. Before feeding the extracted patches to the 
CNN, the data is zero-centered and normalized. 

Specifically, we first apply Canny edge detector to ex¬ 
tract all shadows and non-shadows. Positive patches are 
obtained from the pixels on Canny edges that coincide with 
the groundtruth shadow edges. Likewise, negative (non¬ 
shadow) input patches are obtained from the edge pixels 
that do not overlap with the groundtruth shadow edges. In 
the experiments, as the groundtruth is hand-labelled by hu¬ 
man, the actual shadow edge may not coincide with the 
groundtruth edges. We usually dilate the groundtruth edges 
before overlap with the Canny edges. For the datasets 
with region-based groundtruth such as UCF and UIUC, we 
extract the blob boundaries as the groundtruth of shadow 
edges. As pointed out in [12] non-shadow pixels usually 
outnumber shadow pixels by approximately 6:1 ratio. We 
address this class imbalance problem by setting the number 
of positive samples as the upper bound of the number of 
negative samples to be sampled. To reduce the number of 
redundant samples, we only randomly sample the patches 
on 3 X 3 grid. During the training process, we use stochas¬ 
tic gradient descent. 

Fig. 2 shows the shadow and non-shadow edges learned 
by the proposed structure CNN. We can see that our CNN 
does capture the local structures of shadow edges. Be¬ 
sides learning the classification of shadow and non-shadow 
edges, the proposed network also learns valid labelling tran¬ 
sitions among adjacent pixels (i.e. interactions among the 
pixels in a local patch). 
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Figure 2: (a) Shadow and (b) non-shadow patches learned 
by the proposed structure CNN. Left: input 28 x 28 
patches(red rectangle indicates the central 5x5 region). 
Center: groundtruth 5x5 patches. Right: output 5x5 
patches. 


Our CNN was implemented in unoptimized Matlab code. 
The training took ^ 5 hours on 5 x 5 pixels structured output 
700 iteration) given input size of 500iT-b patches and 
consumed ^ 6GB memory on Intel Quad-Core 2AGHz 
PC. 

2.3. Structured Labelling Shadow Edges 

Given an input image. Canny edge detector is applied to 
find the significant edges in the image. Then, we extracted 
windows along the image edges. The overlapping edge 
patches are then fed to the proposed CNN for labelling. The 
trained structured CNN differentiates between the shadow 
and reflectance edges and predicts the shadow edge struc¬ 
ture of the central region. Our structured CNN achieves 
robust results by combining the predictions of the neigh¬ 
boring ones. Instead of independently assigning a class la¬ 
bel to each pixel, our structured labels predict the interac¬ 
tions among the neighbouring pixels of a local patch. Each 
pixel collects class hypotheses from the structured labels 
predicted for itself and neighboring pixels. We employ a 
simple voting scheme to combine the multiple predicts at 
each pixel at the image edges. 

Fig. 3 illustrates the advantage of shadow edge detection 
with structured output CNN. As can be seen, the proposed 
structured CNN can recover better local edge structures (lo¬ 
cal consistency), avoid assigning implausible label transi¬ 
tions. 

3. Shadow optimization 

We first derive the local and global shadow/bright mea¬ 
sures to model the interactions among the regions across 





Figure 3: Structured shadow edge detection results, (a) input image. We compare the detection results using (b) 1x1 labelling 
CNN(previous method), (c) 5x5 structured labelling CNN. (d) Zoom-in of green and blue patches. Top to bottom:original, 
1x1 and 5x5 outputs. We can see that 5x5 structured CNN is able to learn fine shadow details. 1x1 output has serious spurious 
noise. 


the image. Then, we present our optimization framework to 
solve for the shadow map. 

3.1. Global and local shadow(bright) probability 

We observe that both shadow and bright regions have the 
following characteristic in their spatial layout: shadow re¬ 
gions are much more connected to shadow boundaries, and 
bright regions are more connected to bright boundaries. We 
define the dark side region of a shadow edge as the shadow 
boundary, while the bright side region as the bright bound¬ 
ary. In Fig. 4, we illustrate a tree and its shadow. The blue 
and pink regions are the shadow and bright boundaries, re¬ 
spectively. The grey region is clearly a shadow region as it 
significantly touches the shadow boundary, while the white 
region is clearly a bright region as it largely touches the 
bright boundary. 

The geodesic distance between any two patches p, q in 
an image is defined as the accumulated edge weights along 
their shortest path on the graph: 


n—1 

dgeoiPi Q) — '^'^'^pl=p,p2,---,Pn=q ^ ^ dapp{Pi', 

i=l 

where dapp{p, q) is the Euclidean distance between the av¬ 
erage colours of the two patches. The normalized D{p, q) G 

(0,1] is thus defined as D{p, q) = exp(—which 



Figure 4: An illustrative example. The orange lines are 
shadow edges. The blue and pink regions adjacent to the 
shadow edges are the shadow and bright boundaries, respec¬ 
tively. 


characterizes how much patch p connects(or contributes) to 
patch q. D{p, q) ^ 0 when q) » a dr- 

Let shd be the set of shadow boundary patches, and 
lit be the set of bright boundary patches. Following the 
boundary connectivity introduced in [26], we formulate the 
shadow and bright boundary connectivities of a patch p as: 


COUshdip) = 


Len{p, shd) 
y/ Area{p) 


3.nd coniitip) = 


Len{p, lit) 
Area{p) ’ 


( 1 ) 


respectively. Len{p,x.) = '^q^j.D^p^q) is the connectiv- 





































































ity of p to boundary x. Area{p) = ^{P^Pi) is the 

spanning area of p, where N is the number of patches in the 
image. Note that we set dapp{p G shd, g G lit) = inf, 
which implies that the bright and shadow boundary sets 
are not connected, and no path can cut through the shadow 
edges. We can see that coUshd/Htip) quantifies how heavily 
a patch p is connected to the shadow/bright boundaries in a 
local area {(Tcw = 5 in the experiments.) 

Hence, we define the local shadow/bright measure as: 


'Jshdjlit {p) 


--)’ 


con 


( 2 ) 


The objective cost function is designed to assign the shadow 
regions value 1 and the bright regions value 0, respec¬ 
tively. The optimal shadow map is then obtained by mini¬ 
mizing the cost function. Let the shadow values of N su¬ 
perpixels be The cost function is thus defined as 

N N 

e=j2 + E + E 

i=l i=l i,j 

+ A {si — Si^ 

iGshd,lit 

(4) 


and local shadow/bright probability can be computed as 
p^shdM^p-^ = 1 - lshd,iitip) which is close to 1 when 
shadow/bright connectivity is large, and 0 when it is small. 
Note that Priodp) is only affected by the local shadow 
edges in the region p belong to. If a patch is hardly con¬ 
nected to all the shadow edges in the image, Priodp) is 
low, which indicates that we cannot get a correct prediction 
with the local information. 

We define the global shadow/bright measure as: 

N 

^shd,lit{P^ ^ ^ ^app jP-) Pd'^spa {Pt Pi)'lshd,lit (p) (2) 

i=l 

d? d^ 

where Wapp = exp{-w^), and w^pa = exp{-w^). 
dapp and dgpa are the Euclidean distance between the aver¬ 
age colors and locations, respectively. It is based on the 
observation that if two patches in an image are with the 
same colour and near to each other, they usually are both in 
shadow or bright regions. Global shadow/bright probability 
at p can be computed as = 1 - Tshd,iit{p)- 

3.2. Global shadow optimization 

The input image is first abstracted as a set of nearly 
regular superpixels using the Quick Shift segmentation 
method [23]. The shadow and bright boundary patches can 
be extracted from the local shadow edge detection results. 
We only select the most reliable patches as the shd and lit 
boundary regions. Specifically, if a superpixel at shadow 
edge is darker(brighter) than all the adjacent superpixels, 
we set it as a shadow(bright) boundary patch. If a super¬ 
pixel is brighter than some of its neighbours but darker than 
some others, we consider the patch to be ambiguous and it 
would be discarded. 

After obtaining shd and lit, the local shadow/bright 
measures jshd,iit can be computed at each superpixel re¬ 
spectively, as described in Eq. 2. The global shadow/bright 
measures Tshd,iit{p) can be computed as in Eq. 3, conse¬ 
quently,. 

We formulate the shadow detection problem as the op¬ 
timization of the shadow values of all image superpixels. 


where = rshd,iit{Pi) defined in Eq. 3. Si is the ini¬ 

tial values for p G {lit, shd} , where = 1 for p^ G shd 
and 0 for Pi G lit. We set A = 0.001 in the experiments. 

The four terms are all squared errors and the optimal 
shadow map is computed by least-square. Eig. 5 shows 
the optimized results. 

4. Experiments 

4.1. Datasets 

UCF Shadow Dataset: This dataset contains 355 im¬ 
ages with manually labeled region-based ground truth. 
Only 245/355 images were used in [25, 6]. The split of 
the train/test data is according to the software package pro¬ 
vided by [6] as the original authors did not disclose the split. 
CMU Shadow Dataset: This dataset contains 135 images 
with manually labeled edge-based ground truth for shadow 
on the ground. As our algorithm is not restricted to ground 
shadows, we create the ground plane masks and augment 
the edge-based ground truth to region-based ground truth. 
The authors did not report train/test data split therefore we 
follow the procedure in [12] where even images for training 
and odd images for testing. 

UIUC Shadow Dataset: This dataset contains 108 im¬ 
ages (32 train images and 76 test images) with region-based 
ground truth. 

4.2. Results 

We extensively evaluated our proposed algorithm on 
three publicly available single image shadow datasets. The 
evaluation results on UCE in [12] were based on the full 
dataset. To be comparable to their results, we reported 
both the UCE results using the full dataset and the subset 
245 images. As shown in Table 1, our shadow detection 
method(SCNN-LinearOpt) achieves the best performances 
for all three datasets. In particular,we achieve almost 2% 
and 5% gain over state-of-the-art results for UCE and CMU 
datasets. Table 2 shows the comparisons of class-specific 
detection accuracies. We take the shadow accuracy to be the 
number of pixels correctly detected as shadow divided by 
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Figure 5: Global shadow optimization, (a) input images with superpixel boundaries overlaid, (b)shadow and bright boundary 
patches extracted from the local detected shadow edges, (c) local shadow(L) and bright(R) measures in Eq. 2; (d) global 
shadow(L) and bright(R) measures in Eq. 3. The predicts are propagated over the image, (e) optimized shadow maps by 
minimizing Eq. 4. 



Figure 8: Comparison with Zhu’s work [25]. Leftiinput im¬ 
age. Middle:Zhu’s results. Right: our result. 

the total number of pixels marked as shadow in the ground 
truth. Likewise, non-shadow accuracy is obtained in a sim¬ 
ilar manner. Our approach achieves significantly higher 
shadow accuracies. This is consistent with the finding from 
Fig. 7 where our approach delivers highest AUC. 

Fig. 6 shows some of the qualitative results obtained with 
our method. The results suggest that our shadow detec¬ 
tors performed robustly under various cases ranging from 
indoor images to outdoor and aerial images that exhibit 
soft shadow, low light condition, colour cast, and complex 
self-shading regions. In Fig. 8, we compare our approach 
with Zhus work [25]. Our method can correctly recover 
the shadow regions in the complex scene. In Fig. 9, we 
show that our approach outperforms Guo’s work [6] in the 
ambiguous situation that the object material has the similar 
colour of the shadows in the image. In Fig. 10, we compare 
our shadow edges results with the Lalonde’s results [15]. 
Our method can accurately detect the shadow edges of the 
image which Lalonde’s method fails with. We also com¬ 
pare our method with Khans et al.s very recent work [12] in 
Fig. 11. 

5. Conclusions 

In this paper, we propose an efficient structured labelling 
framework for shadow detection from a single image. We 


Datasets\Methods 

Shadows 

Non-Shadow 

UCF Dataset 

BDT-BCRF [25] 

63.9% 

93.4% 

Unary-Pairwise [6] 

73.3% 

93.7% 

CNN-CRF [12] 

78.0%* 

92.6%* 

SCNN-LinearOpt 

91.1% (91.6%*) 

93.5%(93.4%*) 

UIUC Dataset 

Unary-Pairwise [6] 

71.6% 

95.2% 

CNN-CRF [12] 

84.7% 

95.5% 

SCNN-LinearOpt 

91.3% 

95.03% 

CMU Dataset 

BDT-CRF-Scene [15] 

73.1% 

96.4% 

CNN-CRF [12] 

83.3% 

90.9% 

SCNN-LinearOpt 

91.6% 

97.7% 


Table 2: Pixel-wise Shadow/Non-Shadow Detection Accu¬ 
racy. *: The result is produced using the full dataset. 


Figure 9: Comparison with Guo’s work [6]. Leftiinput im¬ 
age. Middle: Guo’s results. Right: our result. Our method 
can correctly recovery the shadow regions. 

show that the structured CNN Networks framework can 
capture the local structure information of shadow edge. 
Moreover, we present a novel global shadow/bright mea¬ 
sures to model the complex global interactions based on 
spatial layout of image regions. The non-local constraints 
on shadow/bright regions help to overcome ambiguities in 
local inference. Using these non-local region constraints, 
we formulate the shadow detection as a least-square op¬ 
timization problem which can be solved efficiently. Our 





















Methods 

UCF dataset 

UIUC dataset 

CMU dataset 

BDT-BCRF [25] 

88.7% 

- 

- 

BDT-CRF-Scene [15] 

- 

- 

84.8% 

Unary-Pairwise [6] 

90.2% 

89.1% 

- 

CNN-CRF [12] 

90.7%* 

93.2% 

88.8% 

SCNN-LinearOpt 

93.1 %(92.3%*) 

93.4% 

94.0% 


Table 1: Performance Comparisons of Shadow Detection Methods 



Figure 6: Shadow optimization results from detected shadow edges. Top: input shadow images. Bottom: recovered shadow 
regions. 



Figure 10: Shadow edge detection results comparing with 
Lalonde’s work [15]. Left:input image. Middle: Lanlonde’s 
results. Right: our result. Our method can accurately detect 
the shadow edges. 


method can be easily extended to other low-level problems 
such as object edge detection, smoke region detection etc.. 
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